Data Cleaning and Analysis of Q Cup 2018 Coffee Ratings using Python¶
A project for course "Data Exploration using Python" (STA 4243) at the University of Texas at San Antonio (UTSA)
Participants: Robert Hall, Dulce Ximena Cid Sanabria, Ryan Berberek, Max Moran
Table of Contents¶
- Preface and Data Dictionary
- Data and Library Importation
- Exploratory Data Analysis & Cleaning
- Quality Category Questions
- What palate-related variable has the highest correlation with score?
- Are there statistically significant diferences in palate-related quantifications with respect to diferent countries of origin?
- Is there a correlation between altitude and certain taste quantifications?
- Location Metadata Questions
- What countries have the highest overall score?
- Which countries produce the most amounts of coffee?
- What location-related variable has the highest correlation with score?
- Bean Metadata Questions
- Which processing method is used the most?
- Which processing method has the highest number of bags?
- Did any product get graded the same year it was harvested?
- Q Grader Metadata Questions
- What metrics from Q Graders have any significant effect on score?
- Is this category useful?
Preface¶
Midterm Project Dataset Choice: Coffee¶
We chose the coffee dataset because of our group’s history with and enjoyment of the drink. We also considered how we might be biased when analyzing other options.
Our group approached the Coffee Quality Union's 2018 dataset with a focus on exploratory analysis to answer our profiling questions from our preliminary report, largely from correlating to final score, but with some correlations relating individual Quality category variables to Location Metadat such as Country or Region. We visualized and described our results to better outline the answers to these questions.
Data Dictionary¶
We separated variables by categories defined in the original "Coffee Quality Database" by James LeDoux, as well as some distinctions we made from our understanding.
| Quality | Description |
|---|---|
| Aftertaste | The length of positive flavor (taste and aroma) qualities emanating from the back of the palate and remaining after the coffee is expectorated or swallowed. If the aftertaste were short or unpleasant, a lower score would be given. |
| Aroma | The smell of the coffee when infused with hot water. |
| Acidity | Acidity is often described as "brightness" when favorable or "sour" when unfavorable. The final score marked on the horizontal tick-mark scale should reflect the panelist's perceived quality for the Acidity relative to the expected flavor profile based on origin characteristics and/or other factors (degree of roast, intended use, etc.). Coffees expected to be high in Acidity, such as a Kenya coffee, or coffees expected to be low in Acidity, such as a Sumatra coffee, can receive equally high preference scores although their intensity rankings will be quite different. |
| Body | Subfield of flavor. The score given for Flavor should account for the intensity, quality and complexity of its combined taste and aroma, experienced when the coffee is slurped into the mouth vigorously so as to involve the entire palate in the evaluation. |
| Balance | Subfield of flavor. See above. |
| Clean Cup | The lack of interfering negative impressions from first ingestion to final aftertaste, a "transparency" of cup. |
| Uniformity | Uniformity refers to consistency of flavor of the different cups of the sample tasted. If the cups taste different, the rating of this aspect would not be as high. |
| Sweetness | Scalar subfield of flavor. See above. |
| Moisture | Subfield of flavor. Percentage format. |
| Bean Metadata | Description |
|---|---|
| Processing Method | Method for processing the beans. |
| Color | Color of the beans. |
| Species | Either Arabica or Robusta. |
| Variety | Variety of the beans. |
| Quakers | Count of Quakers in the beans sample. Quakers are under-developed coffee beans that can impact the flavor of coffee, as they look the same as regular beans. They can taste dry, ashy, bitter, astringent, and similar to burned popcorn. |
| Category One Defects | "Category 1" defects are considered primary defects that significantly impact the taste of the coffee, like fully black beans, full sour beans, or pods. |
| Category Two Defects | "Category 2" defects are secondary defects that have a less severe impact on taste, including things like broken beans, small sticks, or partial black beans. |
| # of Bags | Number of bags tested. |
| Bag Weight | Bag weight tested. |
| Grading Date | When the beans were graded. |
| Expiration Date | Expiration date of the beans. |
| Location Metadata | Description |
|---|---|
| Owner | Who owns the beans. |
| Country | Where the beans came from. |
| Region | Region where beans came from. |
| Company | Name of the company that managed the beans. |
| Farm | Name of the farm that grew the beans. |
| Producer | Producer of the roasted bean. |
| Lot | Lot number of the tested beans. |
| Mill | Mill where the beans were processed. |
| Altitude (Low, High, Mean) | Altitude low, high, and mean meters. |
| Q Grader Metadata | Description |
|---|---|
| Certification Body | Who certified the sample grade. |
| ICO Number | International Coffee Organization number. |
| Unit of Measurement | Unit of measurement. |
Data and Library Importation¶
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import re
coffee = pd.read_csv('coffee_ratings.csv')
coffee.head()
| total_cup_points | species | owner | country_of_origin | farm_name | lot_number | mill | ico_number | company | altitude | ... | color | category_two_defects | expiration | certification_body | certification_address | certification_contact | unit_of_measurement | altitude_low_meters | altitude_high_meters | altitude_mean_meters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 90.58 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | ... | Green | 0 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.0 | 2200.0 | 2075.0 |
| 1 | 89.92 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | ... | Green | 1 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.0 | 2200.0 | 2075.0 |
| 2 | 89.75 | Arabica | grounds for health admin | Guatemala | san marcos barrancas "san cristobal cuch | NaN | NaN | NaN | NaN | 1600 - 1800 m | ... | NaN | 0 | May 31st, 2011 | Specialty Coffee Association | 36d0d00a3724338ba7937c52a378d085f2172daa | 0878a7d4b9d35ddbf0fe2ce69a2062cceb45a660 | m | 1600.0 | 1800.0 | 1700.0 |
| 3 | 89.00 | Arabica | yidnekachew dabessa | Ethiopia | yidnekachew dabessa coffee plantation | NaN | wolensu | NaN | yidnekachew debessa coffee plantation | 1800-2200 | ... | Green | 2 | March 25th, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1800.0 | 2200.0 | 2000.0 |
| 4 | 88.83 | Arabica | metad plc | Ethiopia | metad plc | NaN | metad plc | 2014/2015 | metad agricultural developmet plc | 1950-2200 | ... | Green | 2 | April 3rd, 2016 | METAD Agricultural Development plc | 309fcf77415a3661ae83e027f7e5f05dad786e44 | 19fef5a731de2db57d16da10287413f5f99bc2dd | m | 1950.0 | 2200.0 | 2075.0 |
5 rows × 43 columns
Exploratory Data Analysis¶
coffee.dtypes
total_cup_points float64 species object owner object country_of_origin object farm_name object lot_number object mill object ico_number object company object altitude object region object producer object number_of_bags int64 bag_weight object in_country_partner object harvest_year object grading_date object owner_1 object variety object processing_method object aroma float64 flavor float64 aftertaste float64 acidity float64 body float64 balance float64 uniformity float64 clean_cup float64 sweetness float64 cupper_points float64 moisture float64 category_one_defects int64 quakers float64 color object category_two_defects int64 expiration object certification_body object certification_address object certification_contact object unit_of_measurement object altitude_low_meters float64 altitude_high_meters float64 altitude_mean_meters float64 dtype: object
coffee.columns
Index(['total_cup_points', 'species', 'owner', 'country_of_origin',
'farm_name', 'lot_number', 'mill', 'ico_number', 'company', 'altitude',
'region', 'producer', 'number_of_bags', 'bag_weight',
'in_country_partner', 'harvest_year', 'grading_date', 'owner_1',
'variety', 'processing_method', 'aroma', 'flavor', 'aftertaste',
'acidity', 'body', 'balance', 'uniformity', 'clean_cup', 'sweetness',
'cupper_points', 'moisture', 'category_one_defects', 'quakers', 'color',
'category_two_defects', 'expiration', 'certification_body',
'certification_address', 'certification_contact', 'unit_of_measurement',
'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'],
dtype='object')
for i in coffee.columns:
if pd.api.types.is_numeric_dtype(coffee[str(i)]):
print(f"Column {str(i)} Maximum: {coffee[str(i)].max()}")
print(f"Column {str(i)} Minimum: {coffee[str(i)].min()}")
print('\n')
else:
continue
Column total_cup_points Maximum: 90.58 Column total_cup_points Minimum: 0.0 Column number_of_bags Maximum: 1062 Column number_of_bags Minimum: 0 Column aroma Maximum: 8.75 Column aroma Minimum: 0.0 Column flavor Maximum: 8.83 Column flavor Minimum: 0.0 Column aftertaste Maximum: 8.67 Column aftertaste Minimum: 0.0 Column acidity Maximum: 8.75 Column acidity Minimum: 0.0 Column body Maximum: 8.58 Column body Minimum: 0.0 Column balance Maximum: 8.75 Column balance Minimum: 0.0 Column uniformity Maximum: 10.0 Column uniformity Minimum: 0.0 Column clean_cup Maximum: 10.0 Column clean_cup Minimum: 0.0 Column sweetness Maximum: 10.0 Column sweetness Minimum: 0.0 Column cupper_points Maximum: 10.0 Column cupper_points Minimum: 0.0 Column moisture Maximum: 0.28 Column moisture Minimum: 0.0 Column category_one_defects Maximum: 63 Column category_one_defects Minimum: 0 Column quakers Maximum: 11.0 Column quakers Minimum: 0.0 Column category_two_defects Maximum: 55 Column category_two_defects Minimum: 0 Column altitude_low_meters Maximum: 190164.0 Column altitude_low_meters Minimum: 1.0 Column altitude_high_meters Maximum: 190164.0 Column altitude_high_meters Minimum: 1.0 Column altitude_mean_meters Maximum: 190164.0 Column altitude_mean_meters Minimum: 1.0
print(coffee['species'].value_counts())
species Arabica 1311 Robusta 28 Name: count, dtype: int64
What palate-related variable has the highest correlation with score?¶
Features:¶
- Aftertaste
- Aroma
- Acidity
- Body
- Balance
- Clean Cup
- Uniformity
- Sweetness
- Moisture
coffee.columns
Index(['total_cup_points', 'species', 'owner', 'country_of_origin',
'farm_name', 'lot_number', 'mill', 'ico_number', 'company', 'altitude',
'region', 'producer', 'number_of_bags', 'bag_weight',
'in_country_partner', 'harvest_year', 'grading_date', 'owner_1',
'variety', 'processing_method', 'aroma', 'flavor', 'aftertaste',
'acidity', 'body', 'balance', 'uniformity', 'clean_cup', 'sweetness',
'cupper_points', 'moisture', 'category_one_defects', 'quakers', 'color',
'category_two_defects', 'expiration', 'certification_body',
'certification_address', 'certification_contact', 'unit_of_measurement',
'altitude_low_meters', 'altitude_high_meters', 'altitude_mean_meters'],
dtype='object')
coffee_salient = coffee[['total_cup_points', 'aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup', 'uniformity', 'sweetness', 'moisture']]
coffee_salient.head()
| total_cup_points | aftertaste | aroma | acidity | body | balance | clean_cup | uniformity | sweetness | moisture | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 90.58 | 8.67 | 8.67 | 8.75 | 8.50 | 8.42 | 10.0 | 10.0 | 10.0 | 0.12 |
| 1 | 89.92 | 8.50 | 8.75 | 8.58 | 8.42 | 8.42 | 10.0 | 10.0 | 10.0 | 0.12 |
| 2 | 89.75 | 8.42 | 8.42 | 8.42 | 8.33 | 8.42 | 10.0 | 10.0 | 10.0 | 0.00 |
| 3 | 89.00 | 8.42 | 8.17 | 8.42 | 8.50 | 8.25 | 10.0 | 10.0 | 10.0 | 0.11 |
| 4 | 88.83 | 8.25 | 8.25 | 8.50 | 8.42 | 8.33 | 10.0 | 10.0 | 10.0 | 0.12 |
coffee_salient.isnull().sum()
total_cup_points 0 aftertaste 0 aroma 0 acidity 0 body 0 balance 0 clean_cup 0 uniformity 0 sweetness 0 moisture 0 dtype: int64
features = coffee_salient[['aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup', 'uniformity', 'sweetness', 'moisture']]
labels = coffee_salient[['total_cup_points']]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, labels, train_size=0.8, test_size=0.2)
from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
model = mlr.fit(x_train, y_train)
points_pred = model.predict(x_test)
print(features.columns)
Index(['aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup',
'uniformity', 'sweetness', 'moisture'],
dtype='object')
print(model.coef_)
[[ 1.96186763 1.36946563 1.35736291 1.02358272 1.28359187 1.04780508 1.03838505 0.97678615 -0.09006529]]
coefs = []
for subset in model.coef_:
for coef in subset:
coefs.append(round(coef, 4))
print(coefs)
cols = [col for col in features.columns]
print(cols)
feature_coefs = pd.DataFrame({'features': cols,
'coefficients': coefs})
feature_coefs.sort_values('coefficients', ascending=False)
feature_coefs
[1.9619, 1.3695, 1.3574, 1.0236, 1.2836, 1.0478, 1.0384, 0.9768, -0.0901] ['aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup', 'uniformity', 'sweetness', 'moisture']
| features | coefficients | |
|---|---|---|
| 0 | aftertaste | 1.9619 |
| 1 | aroma | 1.3695 |
| 2 | acidity | 1.3574 |
| 3 | body | 1.0236 |
| 4 | balance | 1.2836 |
| 5 | clean_cup | 1.0478 |
| 6 | uniformity | 1.0384 |
| 7 | sweetness | 0.9768 |
| 8 | moisture | -0.0901 |
Are there statistically significant diferences in palate-related quantifications with respect to diferent countries of origin?¶
In order to reduce overcomplication, only six nations of the many surveyed were chosen for this study. Countries were not randomly selected, and chosen based on geographical representation.
coffee['country_of_origin'].value_counts()
country_of_origin Mexico 236 Colombia 183 Guatemala 181 Brazil 132 Taiwan 75 United States (Hawaii) 73 Honduras 53 Costa Rica 51 Ethiopia 44 Tanzania, United Republic Of 40 Uganda 36 Thailand 32 Nicaragua 26 Kenya 25 El Salvador 21 Indonesia 20 China 16 India 14 Malawi 11 United States 10 Peru 10 Myanmar 8 Vietnam 8 Haiti 6 Philippines 5 United States (Puerto Rico) 4 Panama 4 Ecuador 3 Laos 3 Burundi 2 Papua New Guinea 1 Rwanda 1 Zambia 1 Japan 1 Mauritius 1 Cote d?Ivoire 1 Name: count, dtype: int64
top8 = coffee[coffee['country_of_origin'].isin(['Mexico',
'Brazil',
'Taiwan',
'Ethiopia',
'Tanzania, United Republic Of',
'Indonesia'])]
top8['country_of_origin'].replace('Tanzania, United Republic Of', 'Tanzania')
top8['country_of_origin'].value_counts()
country_of_origin Mexico 236 Brazil 132 Taiwan 75 Ethiopia 44 Tanzania, United Republic Of 40 Indonesia 20 Name: count, dtype: int64
from statsmodels.stats.multicomp import pairwise_tukeyhsd
sig = 0.05 # significance threshold
tukey_results = pairwise_tukeyhsd(top8['aftertaste'], top8['country_of_origin'], sig)
print(tukey_results)
# under the “reject” column, if True, then there is a significant difference; if False, no significant difference.
Multiple Comparison of Means - Tukey HSD, FWER=0.05 ============================================================================= group1 group2 meandiff p-adj lower upper reject ----------------------------------------------------------------------------- Brazil Ethiopia 0.4533 0.0 0.3038 0.6028 True Brazil Indonesia -0.025 0.9993 -0.2311 0.181 False Brazil Mexico -0.2125 0.0 -0.3058 -0.1191 True Brazil Taiwan -0.0451 0.905 -0.1692 0.0791 False Brazil Tanzania, United Republic Of -0.0155 0.9997 -0.1705 0.1395 False Ethiopia Indonesia -0.4784 0.0 -0.7099 -0.2468 True Ethiopia Mexico -0.6658 0.0 -0.8068 -0.5248 True Ethiopia Taiwan -0.4984 0.0 -0.6615 -0.3353 True Ethiopia Tanzania, United Republic Of -0.4689 0.0 -0.6565 -0.2813 True Indonesia Mexico -0.1874 0.0808 -0.3874 0.0125 False Indonesia Taiwan -0.02 0.9998 -0.2361 0.1961 False Indonesia Tanzania, United Republic Of 0.0095 1.0 -0.2257 0.2447 False Mexico Taiwan 0.1674 0.0004 0.0536 0.2812 True Mexico Tanzania, United Republic Of 0.1969 0.0019 0.0501 0.3438 True Taiwan Tanzania, United Republic Of 0.0295 0.9961 -0.1386 0.1977 False -----------------------------------------------------------------------------
from statsmodels.stats.multicomp import pairwise_tukeyhsd
sig = 0.05 # significance threshold
tukey_results = pairwise_tukeyhsd(top8['acidity'], top8['country_of_origin'], sig)
print(tukey_results)
# under the “reject” column, if True, then there is a significant difference; if False, no significant difference.
Multiple Comparison of Means - Tukey HSD, FWER=0.05 ============================================================================= group1 group2 meandiff p-adj lower upper reject ----------------------------------------------------------------------------- Brazil Ethiopia 0.5322 0.0 0.3957 0.6687 True Brazil Indonesia -0.0204 0.9996 -0.2087 0.1678 False Brazil Mexico -0.0831 0.061 -0.1683 0.0022 False Brazil Taiwan -0.096 0.1511 -0.2094 0.0174 False Brazil Tanzania, United Republic Of -0.0114 0.9999 -0.153 0.1301 False Ethiopia Indonesia -0.5526 0.0 -0.7642 -0.3411 True Ethiopia Mexico -0.6153 0.0 -0.7441 -0.4865 True Ethiopia Taiwan -0.6282 0.0 -0.7771 -0.4792 True Ethiopia Tanzania, United Republic Of -0.5436 0.0 -0.715 -0.3723 True Indonesia Mexico -0.0627 0.9239 -0.2453 0.12 False Indonesia Taiwan -0.0755 0.8836 -0.2729 0.1219 False Indonesia Tanzania, United Republic Of 0.009 1.0 -0.2058 0.2238 False Mexico Taiwan -0.0129 0.9993 -0.1169 0.0911 False Mexico Tanzania, United Republic Of 0.0717 0.6464 -0.0625 0.2058 False Taiwan Tanzania, United Republic Of 0.0845 0.616 -0.069 0.2381 False -----------------------------------------------------------------------------
from statsmodels.stats.multicomp import pairwise_tukeyhsd
sig = 0.05 # significance threshold
tukey_results = pairwise_tukeyhsd(top8['aftertaste'], top8['country_of_origin'], sig)
print(tukey_results)
# under the “reject” column, if True, then there is a significant difference; if False, no significant difference.
Multiple Comparison of Means - Tukey HSD, FWER=0.05 ============================================================================= group1 group2 meandiff p-adj lower upper reject ----------------------------------------------------------------------------- Brazil Ethiopia 0.4533 0.0 0.3038 0.6028 True Brazil Indonesia -0.025 0.9993 -0.2311 0.181 False Brazil Mexico -0.2125 0.0 -0.3058 -0.1191 True Brazil Taiwan -0.0451 0.905 -0.1692 0.0791 False Brazil Tanzania, United Republic Of -0.0155 0.9997 -0.1705 0.1395 False Ethiopia Indonesia -0.4784 0.0 -0.7099 -0.2468 True Ethiopia Mexico -0.6658 0.0 -0.8068 -0.5248 True Ethiopia Taiwan -0.4984 0.0 -0.6615 -0.3353 True Ethiopia Tanzania, United Republic Of -0.4689 0.0 -0.6565 -0.2813 True Indonesia Mexico -0.1874 0.0808 -0.3874 0.0125 False Indonesia Taiwan -0.02 0.9998 -0.2361 0.1961 False Indonesia Tanzania, United Republic Of 0.0095 1.0 -0.2257 0.2447 False Mexico Taiwan 0.1674 0.0004 0.0536 0.2812 True Mexico Tanzania, United Republic Of 0.1969 0.0019 0.0501 0.3438 True Taiwan Tanzania, United Republic Of 0.0295 0.9961 -0.1386 0.1977 False -----------------------------------------------------------------------------
Do there exist correlations between altitude and certain taste quantifications?¶
Data on altitude was extremely messy -- lots of variation in hyphens, where units of measurement were placed, etc. All measurements seem to be consistently in meters. Each instance of the new altitude column is the minimum altitude, both for ease in data cleaning, and to keep the estimates conservative.
coffee.dropna(subset=['altitude'])
coffee['altitude'] = coffee['altitude'].str[:4]
coffee['altitude'] = coffee['altitude'].apply(lambda x: str(x)[:4] if str(x)[:4].isdigit() else None)
coffee['altitude'] = pd.to_numeric(coffee['altitude'], errors='coerce').astype('Int64')
coffee['altitude'].head()
0 1950 1 1950 2 1600 3 1800 4 1950 Name: altitude, dtype: Int64
coffee_salient = coffee[['altitude', 'aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup', 'uniformity', 'sweetness', 'moisture']]
feature = coffee['altitude']
labels = coffee[['aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup', 'uniformity', 'sweetness', 'moisture']]
Since data has missing values, we will fill the 'altitude' column using the mean of the altitude.
from sklearn.impute import SimpleImputer
import numpy as np
x_train, x_test, y_train, y_test = train_test_split(feature, labels, test_size=0.2, train_size=0.8, random_state=27)
imputer = SimpleImputer(strategy='mean')
x_train = np.array(x_train).reshape(-1,1)
x_train = imputer.fit_transform(x_train)
x_test = np.array(x_test).reshape(-1,1)
x_test = imputer.fit_transform(x_test)
mlr = LinearRegression()
model = mlr.fit(x_train, y_train)
points_pred = model.predict(x_test)
labels.columns
Index(['aftertaste', 'aroma', 'acidity', 'body', 'balance', 'clean_cup',
'uniformity', 'sweetness', 'moisture'],
dtype='object')
model.coef_
array([[-2.12529437e-05],
[ 1.66793663e-05],
[ 3.94667029e-05],
[-7.61694694e-06],
[ 8.81035253e-08],
[ 3.20848528e-05],
[ 7.51252524e-06],
[-1.67944937e-06],
[ 1.29207158e-06]])
coefs = []
for subset in model.coef_:
for coef in subset:
coefs.append(coef)
#print(coefs)
cols = [col for col in features.columns]
#print(cols)
feature_coefs = pd.DataFrame({'features': cols,
'coefficients': coefs})
feature_coefs.sort_values('coefficients', ascending=False)
feature_coefs
| features | coefficients | |
|---|---|---|
| 0 | aftertaste | -2.125294e-05 |
| 1 | aroma | 1.667937e-05 |
| 2 | acidity | 3.946670e-05 |
| 3 | body | -7.616947e-06 |
| 4 | balance | 8.810353e-08 |
| 5 | clean_cup | 3.208485e-05 |
| 6 | uniformity | 7.512525e-06 |
| 7 | sweetness | -1.679449e-06 |
| 8 | moisture | 1.292072e-06 |
Overall Takeaways and Answers¶
What palate-related variable has the highest correlation with score?¶
- 'aftertaste' has the highest correlation with score. The scikit-learn linear regression coefficient score for aftertaste is 1.967.
Are there statistically significant diferences in taste quantifications with respect to diferent countries of origin?¶
Statistically significant differences were found between the scores of the following nations under the forementioned categories:
(Bold text indicates the nation(s) with significantly higher scores than their counterpart nations in their respective categories)
aftertaste:
- Brazil and Ethiopia
- Brazil and Mexico
- Ethiopia and Indonesia, Mexico, Taiwan and Tanzania
- Mexico and Taiwan, Tanzania
acidity:
- Brazil and Ethiopia
- Ethiopia and Indonesia, Mexico, Taiwan and Tanzania
aftertaste:
- Brazil and Ethiopia
- Brazil and Mexico
- Ethiopia and Indonesia, Mexico, Taiwan and Tanzania
- Mexico and Taiwan, Tanzania
A significance threshold of a = 0.05 was used. ANOVA using a Tukey range test was applied to achieve the p-scores.
Do there exist strong correlations between altitude and certain taste quantifications?¶
The coefficients outputted by the model are quite small, likely due to a lack of feature scaling or incidence of multicollinearity, not within the scope of the course for which this project is created.
In relative terms, altitude puts downward pressure on the scores of 'aftertaste' and 'body', and places upward pressure on balance, uniformity, acidity, and the propensity for the coffee cup to be clean after drinking to completion.
Location Metadata Questions¶
coffee.rename(columns={'owner': 'Owner', 'producer': 'Producer', 'country_of_origin': 'Country', 'region': 'Region', 'company': 'Company',
'farm_name': 'Farm', 'lot': 'Lot', 'mill': 'Mill', 'altitude_mean_meters': 'Altitude (mean)','total_cup_points':'Total Cup Points',
'number_of_bags':'Number of Bags','altitude_low_meters':'Altitude (low)', 'altitude_high_meters':'Altitude (high)'}, inplace=True)
## What countries have the highest overall score?
# Group the data by country and calculate the mean of Total Cup Points
country_points = coffee.groupby('Country')['Total Cup Points'].mean().reset_index()
# Create a Choropleth map using Plotly
map = px.choropleth(
country_points,
locations='Country',
locationmode='country names',
color='Total Cup Points',
hover_name='Country',
color_continuous_scale=px.colors.diverging.Spectral,
labels={'Total Cup Points': 'Average Cup Points'})
# Show the choropleth map
map.show()
coffee.groupby('Country')['Total Cup Points'].mean().reset_index().sort_values(by='Total Cup Points', ascending=False)
| Country | Total Cup Points | |
|---|---|---|
| 23 | Papua New Guinea | 85.750000 |
| 8 | Ethiopia | 85.484091 |
| 14 | Japan | 84.670000 |
| 31 | United States | 84.433000 |
| 15 | Kenya | 84.309600 |
| 22 | Panama | 83.707500 |
| 30 | Uganda | 83.451944 |
| 3 | Colombia | 83.106557 |
| 7 | El Salvador | 83.052857 |
| 2 | China | 82.927500 |
| 26 | Rwanda | 82.830000 |
| 4 | Costa Rica | 82.789020 |
| 29 | Thailand | 82.573750 |
| 13 | Indonesia | 82.565500 |
| 24 | Peru | 82.526000 |
| 0 | Brazil | 82.405909 |
| 28 | Tanzania, United Republic Of | 82.369500 |
| 27 | Taiwan | 82.001333 |
| 35 | Zambia | 81.920000 |
| 9 | Guatemala | 81.846575 |
| 16 | Laos | 81.833333 |
| 1 | Burundi | 81.830000 |
| 32 | United States (Hawaii) | 81.820411 |
| 33 | United States (Puerto Rico) | 81.727500 |
| 17 | Malawi | 81.711818 |
| 34 | Vietnam | 81.208750 |
| 12 | India | 81.082857 |
| 19 | Mexico | 80.890085 |
| 25 | Philippines | 80.834000 |
| 20 | Myanmar | 80.750000 |
| 18 | Mauritius | 80.500000 |
| 21 | Nicaragua | 80.458077 |
| 6 | Ecuador | 80.220000 |
| 11 | Honduras | 79.357547 |
| 5 | Cote d?Ivoire | 79.330000 |
| 10 | Haiti | 77.180000 |
The Country of Origin that has the highest overral score is Papua New Guinea with 85.75 points. On second place we have Ethiopia with 85.48 points and third place is Japan with 84.67 points. By the other hand, the Haiti has the lowest overral score from the coffee database.
![]()
Which countries produce the most amounts of coffee? (country, number of bags)¶
coffee.groupby('Country')['Number of Bags'].sum().reset_index().sort_values(by='Number of Bags', ascending=False)
| Country | Number of Bags | |
|---|---|---|
| 3 | Colombia | 41204 |
| 9 | Guatemala | 36868 |
| 0 | Brazil | 30534 |
| 19 | Mexico | 24140 |
| 11 | Honduras | 13167 |
| 8 | Ethiopia | 11761 |
| 4 | Costa Rica | 10354 |
| 21 | Nicaragua | 6406 |
| 30 | Uganda | 5477 |
| 7 | El Salvador | 4449 |
| 15 | Kenya | 3971 |
| 28 | Tanzania, United Republic Of | 3760 |
| 12 | India | 3011 |
| 24 | Peru | 2336 |
| 27 | Taiwan | 1914 |
| 13 | Indonesia | 1658 |
| 29 | Thailand | 1310 |
| 32 | United States (Hawaii) | 833 |
| 17 | Malawi | 557 |
| 22 | Panama | 537 |
| 1 | Burundi | 520 |
| 31 | United States | 462 |
| 10 | Haiti | 390 |
| 25 | Philippines | 259 |
| 26 | Rwanda | 150 |
| 16 | Laos | 81 |
| 33 | United States (Puerto Rico) | 71 |
| 2 | China | 55 |
| 14 | Japan | 20 |
| 35 | Zambia | 13 |
| 34 | Vietnam | 11 |
| 20 | Myanmar | 10 |
| 23 | Papua New Guinea | 7 |
| 6 | Ecuador | 3 |
| 5 | Cote d?Ivoire | 2 |
| 18 | Mauritius | 1 |
The country that produces the most amount of coffee is Colombia with 41,205 number of bags. On second place we have Guatemala with 85.48 points and third place is Brazil with 84.67 points. By the other hand, the Haiti has the lowest overral score from the coffee database.
![]()
What location-related variable has the highest correlation with score?¶
numeric_data = coffee[['Altitude (mean)', 'Altitude (low)', 'Altitude (high)'] + ['Total Cup Points']].dropna()
location_correlation = numeric_data.corr().drop('Total Cup Points')
# Correlation heatmap
sns.heatmap(location_correlation, annot=True, cmap='coolwarm')
plt.title('Correlation Map (Location variables)')
plt.show()
If we analyze the data in general, the correlation between the altitude and the total cup points indicates that there is almost no relation between these variables, but we want to get the correlation divided by the country of origin.
# Group by Country the correlation with the mean altitude
country_altitude_corr = coffee.groupby('Country').apply(
lambda group: group[['Altitude (mean)', 'Total Cup Points']].corr().iloc[0, 1]).dropna().sort_values(ascending=False).reset_index()
country_altitude_corr.rename(columns={0:'Correlation'})
C:\Users\berbr\AppData\Local\Temp\ipykernel_15628\895299115.py:2: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
| Country | Correlation | |
|---|---|---|
| 0 | United States (Hawaii) | 1.000000 |
| 1 | Peru | 1.000000 |
| 2 | Ecuador | 1.000000 |
| 3 | Burundi | 1.000000 |
| 4 | Haiti | 0.742130 |
| 5 | Philippines | 0.726317 |
| 6 | India | 0.587434 |
| 7 | Uganda | 0.478961 |
| 8 | Ethiopia | 0.415867 |
| 9 | Taiwan | 0.399207 |
| 10 | Malawi | 0.399116 |
| 11 | Thailand | 0.381836 |
| 12 | United States | 0.274804 |
| 13 | Costa Rica | 0.235791 |
| 14 | Panama | 0.191214 |
| 15 | El Salvador | 0.125397 |
| 16 | Kenya | 0.112667 |
| 17 | Honduras | 0.048521 |
| 18 | Mexico | 0.042369 |
| 19 | Nicaragua | 0.037648 |
| 20 | Brazil | 0.027320 |
| 21 | Colombia | -0.006585 |
| 22 | Guatemala | -0.053725 |
| 23 | Tanzania, United Republic Of | -0.080831 |
| 24 | China | -0.107649 |
| 25 | Vietnam | -0.141472 |
| 26 | Myanmar | -0.150707 |
| 27 | Laos | -0.154993 |
| 28 | Indonesia | -0.565402 |
Altitude tends to have a more effect on coffee quality in certain regions (for example, higher altitudes in Ethiopia or Colombia), but this relationship may vary by location.
Which farms, mills and producers have the most consistent coffee quality over time?¶
# Calculate the standard deviation of 'total_cup_points' for each farm
farm_consistency = coffee.groupby('Farm')['Total Cup Points'].std().dropna().sort_values().reset_index()
farm_consistency.head()
| Farm | Total Cup Points | |
|---|---|---|
| 0 | gran manzana y el aguacate | 0.113137 |
| 1 | el barbaro | 0.113137 |
| 2 | gao chun fang 高醇坊 | 0.120208 |
| 3 | coopetarrazú | 0.120208 |
| 4 | la orduña, coatepec, veracruz | 0.120208 |
The farms that are most consistent with the coffee quality they produce are Gran Manzana y El Aguacate and El Barbaro, both with the lowest standar deviation of 0.113137.
# Calculate the standard deviation of 'total_cup_points' for each mill
mill_consistency = coffee.groupby('Mill')['Total Cup Points'].std().dropna().sort_values().reset_index()
mill_consistency.head()
| Mill | Total Cup Points | |
|---|---|---|
| 0 | ibero | 0.000000 |
| 1 | el olvido, zentla, veracruz | 0.000000 |
| 2 | internacional armazens gerais | 0.056569 |
| 3 | 17/18 | 0.063640 |
| 4 | santa adelaida | 0.113137 |
The mills that are most consistent with the coffee quality they produce are Ibero and El Olvido, Zentla, Veracruz, both with the lowest standar deviation of 0.
# Calculate the standard deviation of 'total_cup_points' for each producer
producer_consistency = coffee.groupby('Producer')['Total Cup Points'].std().dropna().sort_values().reset_index()
producer_consistency.head()
| Producer | Total Cup Points | |
|---|---|---|
| 0 | Santos Fonseca | 0.000000 |
| 1 | Maria Rogeria Costa Pereira | 0.000000 |
| 2 | Assefa Belay Coffee Producer | 0.056569 |
| 3 | COFFEE FOR PEACE,INC. | 0.113137 |
| 4 | El Barbaro, S.A. de C.V. | 0.113137 |
The producers that are most consistent with the coffee quality they produce are Santos Fonseca and Maria Rogeria Costa Pereira, both with the lowest standar deviation of 0.
These farms, mills and producers with the lowest standard deviation indicate stable growing conditions, more reliable practices, and consistent quality coffee.
Bean Metadata Questions¶
Which processing method is used the most?¶
pm_counts = coffee['processing_method'].value_counts().reset_index()
print(pm_counts)
pm_bar = px.bar(pm_counts, x=pm_counts['processing_method'], y=pm_counts['count'])
pm_bar.show()
processing_method count 0 Washed / Wet 815 1 Natural / Dry 258 2 Semi-washed / Semi-pulped 56 3 Other 26 4 Pulped natural / honey 14
In broad percentages, Wet processing makes up 69.7%, Dry makes up just over 22%, Semi-Wet makes up >5%, Other makes up >2.5%, and Honey makes up just over 1%.
Which processing method has the highest number of bags?¶
pm_bags = coffee.groupby('processing_method')['Number of Bags'].sum().reset_index().sort_values(by='Number of Bags', ascending=False)
print(pm_bags)
pm_bags_bar = px.bar(pm_bags, x=pm_bags['processing_method'], y=pm_bags['Number of Bags'])
pm_bags_bar.show()
processing_method Number of Bags 4 Washed / Wet 128071 0 Natural / Dry 40581 3 Semi-washed / Semi-pulped 5304 1 Other 3342 2 Pulped natural / honey 2341
Wet processing method makes up over 71% of the individual bags processed, with Dry coming second at 22.5%. The remaining methods make up only 6.5% of bags.
Did any product get graded the same year it was harvested?¶
# Clean grading_date column which is by default un-parseable for to_datetime
def remove_date_suffix(column):
return re.sub(r"\b([0123]?[0-9])(st|th|nd|rd)\b",r"\1", column)
coffee['grading_date'] = coffee['grading_date'].apply(remove_date_suffix)
coffee['grading_date'] = pd.to_datetime(coffee['grading_date'], format='mixed')
# Create and convert year to object dtype
coffee['year'] = coffee['grading_date'].dt.year
coffee['year'] = coffee['year'].astype(str)
# Preliminary analysis of harvest vs graded year.
coffee['harvest_vs_grading_years'] = coffee['harvest_year'].equals(coffee['year'])
year_comparison = coffee['harvest_vs_grading_years'].value_counts().reset_index()
print(year_comparison)
harvest_vs_grading_years count 0 False 1339
Without deep-cleaning the messy harvest_year column, there are no immediate direct matches between the extracted grading year and harvest year. It is likely that values in nonstandard formats represent the observations with matching harvest and grading years. Further data cleaning is needed.
Q Grader Metadata Questions¶
duplicated_coffee = coffee[coffee.duplicated()]
print("Duplicate rows: ", duplicated_coffee.shape)
Duplicate rows: (0, 45)
Grab relevant columns¶
coffee.rename(columns={'Total Cup Points': 'total_cup_points'}, inplace=True)
qBasedCoffee = coffee[['total_cup_points','certification_body']]
qBasedCoffee.head()
| total_cup_points | certification_body | |
|---|---|---|
| 0 | 90.58 | METAD Agricultural Development plc |
| 1 | 89.92 | METAD Agricultural Development plc |
| 2 | 89.75 | Specialty Coffee Association |
| 3 | 89.00 | METAD Agricultural Development plc |
| 4 | 88.83 | METAD Agricultural Development plc |
Visualization¶
Let's display a basic chart to better visualize the distribution of certification bodies and total cup points.
plt.figure(figsize=(20, 6))
plt.bar(qBasedCoffee['certification_body'], qBasedCoffee['total_cup_points'], width = 0.4)
plt.xlabel("Certification Bodies")
plt.ylabel("Total Cup Points")
plt.title("Average Cup Points per Certification Body")
ax1 = plt.subplot()
certBodies = qBasedCoffee.certification_body.unique()
ax1.set_xticklabels(certBodies, rotation=90)
plt.show()
C:\Users\berbr\AppData\Local\Temp\ipykernel_15628\2269462007.py:8: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. C:\Users\berbr\AppData\Roaming\Python\Python312\site-packages\IPython\core\pylabtools.py:170: UserWarning: ) missing from font(s) DejaVu Sans.
Visual Inspection¶
You can see perhaps a slight downwards slope but nothing significant. Let's check by running a linear regression model.
### Basic Linear Model
from statsmodels.formula.api import ols
fit = ols('total_cup_points ~ C(certification_body)', data=qBasedCoffee).fit()
fit.summary()
| Dep. Variable: | total_cup_points | R-squared: | 0.122 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.105 |
| Method: | Least Squares | F-statistic: | 7.277 |
| Date: | Tue, 15 Oct 2024 | Prob (F-statistic): | 7.08e-24 |
| Time: | 01:30:20 | Log-Likelihood: | -3490.3 |
| No. Observations: | 1339 | AIC: | 7033. |
| Df Residuals: | 1313 | BIC: | 7168. |
| Df Model: | 25 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 80.8227 | 0.231 | 349.424 | 0.000 | 80.369 | 81.276 |
| C(certification_body)[T.Africa Fine Coffee Association] | 1.5265 | 0.527 | 2.899 | 0.004 | 0.493 | 2.560 |
| C(certification_body)[T.Almacafé] | 2.5233 | 0.339 | 7.437 | 0.000 | 1.858 | 3.189 |
| C(certification_body)[T.Asociacion Nacional Del Café] | 0.8298 | 0.353 | 2.354 | 0.019 | 0.138 | 1.521 |
| C(certification_body)[T.Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C.] | 1.8573 | 1.372 | 1.354 | 0.176 | -0.834 | 4.548 |
| C(certification_body)[T.Asociación de Cafés Especiales de Nicaragua] | -0.6465 | 1.194 | -0.542 | 0.588 | -2.988 | 1.695 |
| C(certification_body)[T.Blossom Valley International] | 1.3947 | 0.493 | 2.832 | 0.005 | 0.428 | 2.361 |
| C(certification_body)[T.Blossom Valley International ] | -0.4027 | 3.320 | -0.121 | 0.903 | -6.915 | 6.110 |
| C(certification_body)[T.Brazil Specialty Coffee Association] | 0.9834 | 0.466 | 2.110 | 0.035 | 0.069 | 1.898 |
| C(certification_body)[T.Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao] | 2.3473 | 3.320 | 0.707 | 0.480 | -4.165 | 8.860 |
| C(certification_body)[T.Centro Agroecológico del Café A.C.] | 1.5223 | 1.194 | 1.275 | 0.202 | -0.819 | 3.864 |
| C(certification_body)[T.Coffee Quality Institute] | -0.0013 | 1.273 | -0.001 | 0.999 | -2.498 | 2.496 |
| C(certification_body)[T.Ethiopia Commodity Exchange] | 4.3573 | 0.814 | 5.352 | 0.000 | 2.760 | 5.954 |
| C(certification_body)[T.Instituto Hondureño del Café] | -1.6487 | 0.486 | -3.392 | 0.001 | -2.602 | -0.695 |
| C(certification_body)[T.Kenya Coffee Traders Association] | 2.9491 | 0.743 | 3.969 | 0.000 | 1.492 | 4.407 |
| C(certification_body)[T.METAD Agricultural Development plc] | 5.8433 | 0.886 | 6.596 | 0.000 | 4.105 | 7.581 |
| C(certification_body)[T.NUCOFFEE] | 2.5617 | 0.598 | 4.280 | 0.000 | 1.388 | 3.736 |
| C(certification_body)[T.Salvadoran Coffee Council] | 1.6564 | 1.025 | 1.616 | 0.106 | -0.354 | 3.667 |
| C(certification_body)[T.Specialty Coffee Association] | 1.1459 | 0.298 | 3.851 | 0.000 | 0.562 | 1.730 |
| C(certification_body)[T.Specialty Coffee Association of Costa Rica] | 1.7159 | 0.555 | 3.089 | 0.002 | 0.626 | 2.806 |
| C(certification_body)[T.Specialty Coffee Association of Indonesia] | 1.7823 | 1.073 | 1.662 | 0.097 | -0.322 | 3.886 |
| C(certification_body)[T.Specialty Coffee Institute of Asia] | 3.5160 | 0.860 | 4.090 | 0.000 | 1.830 | 5.202 |
| C(certification_body)[T.Tanzanian Coffee Board] | 1.6923 | 1.372 | 1.234 | 0.218 | -0.999 | 4.383 |
| C(certification_body)[T.Torch Coffee Lab Yunnan] | 1.9273 | 2.353 | 0.819 | 0.413 | -2.689 | 6.544 |
| C(certification_body)[T.Uganda Coffee Development Authority] | 2.4641 | 0.629 | 3.915 | 0.000 | 1.229 | 3.699 |
| C(certification_body)[T.Yunnan Coffee Exchange] | 2.9481 | 0.984 | 2.997 | 0.003 | 1.018 | 4.878 |
| Omnibus: | 2321.520 | Durbin-Watson: | 0.866 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3642897.434 |
| Skew: | -11.258 | Prob(JB): | 0.00 |
| Kurtosis: | 257.534 | Cond. No. | 38.7 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Preliminary Anaylsis¶
With an R^2 value of only 0.122, there is little evidence to suggest the certificaiton body is significant in predicting the total_cup_points.
Narrowing Categories¶
Let's see if we only use the top few certification bodies we can find some sort of difference.
certFreq = qBasedCoffee['certification_body'].value_counts()
certFreq = certFreq.to_frame()
certFreq
| count | |
|---|---|
| certification_body | |
| Specialty Coffee Association | 313 |
| AMECAFE | 205 |
| Almacafé | 178 |
| Asociacion Nacional Del Café | 155 |
| Brazil Specialty Coffee Association | 67 |
| Instituto Hondureño del Café | 60 |
| Blossom Valley International | 58 |
| Africa Fine Coffee Association | 49 |
| Specialty Coffee Association of Costa Rica | 43 |
| NUCOFFEE | 36 |
| Uganda Coffee Development Authority | 32 |
| Kenya Coffee Traders Association | 22 |
| Ethiopia Commodity Exchange | 18 |
| Specialty Coffee Institute of Asia | 16 |
| METAD Agricultural Development plc | 15 |
| Yunnan Coffee Exchange | 12 |
| Salvadoran Coffee Council | 11 |
| Specialty Coffee Association of Indonesia | 10 |
| Centro Agroecológico del Café A.C. | 8 |
| Asociación de Cafés Especiales de Nicaragua | 8 |
| Coffee Quality Institute | 7 |
| Asociación Mexicana De Cafés y Cafeterías De Especialidad A.C. | 6 |
| Tanzanian Coffee Board | 6 |
| Torch Coffee Lab Yunnan | 2 |
| Central De Organizaciones Productoras De Café y Cacao Del Perú - Central Café & Cacao | 1 |
| Blossom Valley International\r\n | 1 |
topBodies = ('Specialty Coffee Association','AMECAFE','Almacafé','Asociacion Nacional Del Café')
qBasedCoffeeNarrowed = qBasedCoffee.loc[qBasedCoffee['certification_body'].isin(topBodies)]
print(qBasedCoffeeNarrowed)
total_cup_points certification_body 2 89.75 Specialty Coffee Association 11 87.92 Almacafé 12 87.92 Almacafé 13 87.92 Specialty Coffee Association 15 87.58 Almacafé ... ... ... 1334 78.75 Specialty Coffee Association 1335 78.08 Specialty Coffee Association 1336 77.17 Specialty Coffee Association 1337 75.08 Specialty Coffee Association 1338 73.75 Specialty Coffee Association [851 rows x 2 columns]
from statsmodels.formula.api import ols
fit = ols('total_cup_points ~ C(certification_body)', data=qBasedCoffeeNarrowed).fit()
fit.summary()
| Dep. Variable: | total_cup_points | R-squared: | 0.099 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.096 |
| Method: | Least Squares | F-statistic: | 31.05 |
| Date: | Tue, 15 Oct 2024 | Prob (F-statistic): | 4.75e-19 |
| Time: | 01:30:20 | Log-Likelihood: | -2012.4 |
| No. Observations: | 851 | AIC: | 4033. |
| Df Residuals: | 847 | BIC: | 4052. |
| Df Model: | 3 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 80.8227 | 0.180 | 448.349 | 0.000 | 80.469 | 81.177 |
| C(certification_body)[T.Almacafé] | 2.5233 | 0.264 | 9.542 | 0.000 | 2.004 | 3.042 |
| C(certification_body)[T.Asociacion Nacional Del Café] | 0.8298 | 0.275 | 3.020 | 0.003 | 0.291 | 1.369 |
| C(certification_body)[T.Specialty Coffee Association] | 1.1459 | 0.232 | 4.941 | 0.000 | 0.691 | 1.601 |
| Omnibus: | 381.208 | Durbin-Watson: | 0.262 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3180.077 |
| Skew: | -1.833 | Prob(JB): | 0.00 |
| Kurtosis: | 11.732 | Cond. No. | 4.98 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Further Analysis¶
After narrowing the certification bodies analyzed, the linear model output an even worse R^2 value of 0.099. This leads us to believe that this variable is counterproductive to our anaylsis.
Answering the Research Questions¶
- What metrics from Q Graders have any significant effect on score?
- certification_body, certification_address, and certification_contact are the only variables related to Q Graders we are given in this data set. The only useful variable is the first one, because the second and third are essentially just identifying numbers for each grader. After some basic linear regression model anaylsis, the certification_body doesn't seem to be helpful.
- Is this category useful?
- This category does not seem to be usefull without further information on the Q Graders themselves. We suggest ommitting the variable and focusing elsewhere.